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Abstract 

Equilibrium can be represented as an ensemble of uncoupled systems undergoing dynamics in 
which detailed balance is maintained. We describe a simulation protocol based on the "weighted 
ensemble" (WE) approach [Huber and Kim, Biophys. J., 1996], in which weak coupling between 
parallel simulations can capture both dynamical and equilibrium properties without statistical bias. 
The approach is demonstrated on molecular systems. Of particular practical importance, analysis 
of the weighted ensemble does not require states to be chosen in advance, and permits the 
calculation of rates between arbitrary states chosen after the simulation is run. We describe a 
history-dependent non-Markovian formulation to correct for a potentially significant bias in kinetic 
properties measured from equilibrium WE simulations. 

1 Introduction 



Although it is textbook knowledge that the functions of biomacromolecules are strongly coupled 
to their conformational motions and fluctuations [TJ, computer simulation of such motions has been 
a challenge for decades 0. Typically, distinct algorithms are employed to estimate equilibrium 
quantities (e.g., OH]) and dynamical properties (e.g., (5]-[9l)- In principle, a single long dynamics 
trajectory would be sufficient to determine both equilibrium and dynamical properties [TO) , but 
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such simulations remain impractical for most systems of interest. More technical approaches that 
can yield both equilibrium and dynamical simulation, sometimes under minor assumptions, have 
drawn increasing attention [TTI - f1"5l . 

Here, we demonstrate a statistically exact simulation approach which directly embodies equi- 
librium (Fig. Q) via multiple dynamical replicas in detailed balance, but which also yields non- 
equilibrium dynamical information. The method generalizes the "weighted ensemble" (WE) strat- 
egy 15|. WE simulation has previously been used to characterize non-equilibrium dynamical path- 
ways and rates in model molecular systems (e.g., [1"6l - f1"8] ). and for equilibrium simulation |14II19] . 
The rigorous path-sampling formalism underlying WE simulation has been described (20). Very 
recent work pursues WE strategies similar to that proposed below [THQ11 without accounting for 
history dependence in the same fashion. 

2 Theoretical formulation 

WE simulation uses multiple simultaneous trajectories that are occasionally coupled by repli- 
cation or combination events to gain efficiency over simple brute-force simulation [5]. The coupling 
events typically are governed by a static partition of configuration space into "bins" (Fig. QJ;), al- 
though dynamical/adaptive bins may be used |20]. In the case of static bins, when one or more 
trajectories enters an unoccupied bin, those trajectories are replicated to conform to a (typically) 
preset value, M. Similarly, if more than M trajectories are found to occupy a bin, trajectories are 
combined statistically in a pairwise fashion until M remain. As described below, each trajectory is 
assigned a weight (i.e., probability) so that dynamics remain statistically unbiased (20]. Weights 
are reassigned at occasional intervals according to the condition being simulated (e.g., equilibrium 
or a nonequilibrium steady state) [19]. 

The goal of the present work is to reassign weights in a way that achieves equilibrium as rapidly 
as possible (after which rates between arbitrary states can be computed, as described below). Our 
procedure is based on the lack of probability flow among bins during equilibrium: p- q /c° q = pTk^, 
where p° q is the equilibrium probability (sum of trajectory weights) in bin i and k^ is the conditional 
transition probability per unit time (i.e., rate) from bin % to j observed in equilibrium. Early in a 
simulation, however, the lack of an equilibrium distribution implies that measured rates among the 
finite bins will differ from their equilibrium values. Nevertheless, we can assign non-equilibrium bin 
probabilities p based on the transient, non-equilibrium rates k according to the detailed balance 
condition 

pi/pj <- kji/kij . (1) 
Alternatively, steady-state equations, which hold in equilibrium, can be solved for {pj values: 

^Pikij = ^2,Vikij ■ ( 2 ) 

j j 

Adjusting trajectory weights toward their equilibrium values via Eq. 0) or Eq. Q is necessary for 
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Figure 1 : Equilibrium in different representations, (a) Ensemble of trajectories with arrow tips indicating instanta- 
neous configuration and tails showing recent history in the space of two schematic coordinates qi and q 2 - States A and 
B, shown in grey, are two arbitrary regions of phase space, (b) Dissection into two subsets based on whether trajectory 
was most recently in state A (black solid arrows, the "a" steady state) or state B (red dashed, "/3"). (c) Statistically 
equivalent ensemble of weighted trajectories, with arrow thickness suggesting weight. Configuration space has been 
divided into cells ("bins") which each contain an equal number of trajectories. 

the rates k to converge toward k eq . Note that Eq. © tends to have a unique solution, unlike Eq. {1} 
which is overdetermined when rates k are imperfectly estimated. As described below, however, 
the use of Eq. (0} or Eq. ((2) is generally not sufficient to avoid bias in estimating non-equilibrium 
dynamical properties. 

Dynamical quantities can also be calculated from an equilibrium WE simulation, with additional 
analysis. Once equilibrium among trajectories is established in a WE simulation, in the strict sense 
described below, the rates between arbitrary states A and B can be measured. Any region of 
phase space can be used to define a state, e.g., all configurations within a threshold "distance" 
from some reference configuration. Rates can be computed on-the-fly during a WE simulation 
if states are defined in advance, or in a post-analysis of saved trajectories. Several steps are 
required. First, as shown in (Fig. [Tb), the equilibrium set of trajectories is decomposed into two 
steady states: the a steady state consisting of trajectories more recently in A than B, and the (5 
steady state with those most recently in B f9l[2T|: these were denoted "AB" and "BA" steady states, 
respectively, in Ref. [21 j. Trajectories can be "labeled" according to the last state visited — i.e., 
classified as a or j3 — during a WE simulation or in a post-analysis. See also Fig.0 

Once a true equilibrium among trajectories is decomposed into the two steady states, a rate 
can be directly computed from the probability arriving to the final state [9l [1"9ll2"2"l via 

1 _ Flux(A -)■ B\a) 

MFPT(A -> B) ~ rfa) ' () 

where MFPT is the mean-first-passage time, Flux(^4 -> B\a) is the probability per unit time arriving 
to state B in the a steady state andp(a) is the total probability in the a steady state. By construction 
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Figure 2: Constructing a labeled rate matrix for unbiased calculations. For purposes of illustration, here state A 
consists solely of bin 1 and state B solely of bin 2. Left: A traditional rate matrix with history-blind elements. The rate 
hj gives the conditional probability for transitioning from bin i to bin j in a fixed time increment, regardless of previous 
history. Right: The labeled rate matrix accounting for history. The element fcf/ is the conditional probability for the i 
to j transition for trajectories initially in the fx sub-ensemble which transition to the v sub-ensemble, where [i and v are 
either a or p. The labeled rate matrix correctly assigns the a and /3 sub-populations of each bin, whereas the traditional 
matrix may not. 

p(a) + p(/3) = 1. Normalizing by p effectively excludes the reverse steady state and the rate 
calculation only "sees" a uni-directional steady state as in Ref. [T9]. An expression analogous to 
Eq. © applies for the B ->• A MFPT. Also note that the effective first order rate constant, defined 
by Flux(A -4 B\a)/p e 2, can be determined from equilibrium WE simulation because p°^ can be 
directly computed. Eq. [3] is the basis for the "direct" MFPT estimates reported below. 

2.1 Correcting for bias in rate estimation 

It is critical to realize that Eq. ® assumes more than the configurational equilibrium obtained 
by application of Eq. Q} or Eq. ©. In fact, the simple application of Eq. © can lead to significant 
bias in rate estimates in equilibrium implementations of WE, as our data show. The error arises 
because the correct configurational distribution does not guarantee the correct distribution of a 
vs. /3 trajectories as required by Eq. ®. It is not clear whether this bias would be apparent in 
configurationally symmetric systems, but it is clear in the molecular examples studied below. 

It is possible to construct a non-Markovian description that rigorously accounts for the a//3 
labeling of trajectories and yields unbiased rate estimates, even based on bins exhibiting highly 
non-Markovian behavior. The unbiased formalism is built from the observation that, as embodied 
in Eq. (O, the non-equilibrium (and generally non-Markovian) information required for rate estima- 
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tion is solely the a/fi distribution. That is, the required trajectory history is whether a particular 
trajectory originated in A (an a trajectory) or B (a /3 trajectory) — and it is this property that can be 
biased by reweighting based on Eq. {1} or Eq. ([2j. The a (or /3) designation is sufficient because 
then the a or j3 steady state is recovered precisely, leading in turn to the exact rates (9j[T9]. 

To eliminate bias by strictly and separately conforming to the a and /3 steady states, we de- 
compose the equilibrium population into a and /3 components for each bin i: 

Pl q =P?+P^, (4) 

which implies p(a) = Y,iPt ancl PiP) = Y,iPi- We called this a "labeled" analysis. Thus, with 
N bins, a set of 2N probabilities is required rather than N. Similarly, a 2N x 2N rate matrix is 
required: k^, where n and u can be either the a or /3 subsets of trajectories. See Fig. [2) 

Each of the previously considered % rate elements is thus decomposed into four history- 
dependent elements which account for whether the particular trajectory was last in state A or B 
and whether the trajectory transitions between the a and /3 subsets. See Fig. |2j The analysis 
assumes states consist strictly of one or more bins, but this is always possible in a post-analysis 
without loss of generality. Note that more than half the k^ elements are zero. For example, 
consider a bin in the 'intermediate' region (neither A nor B), such as bin 2 in Fig.|2j in this region 
an a trajectory cannot change into a /? trajectory and vice versa; hence rates for these processes 
are zero. Similary, an a trajectory in the intermediate region which enters a bin in B must turn into 
a p trajectory, so the rate will always be zero to the a components of bins in B. 

The non-Markovian results below use the formalism stemming from Eq. |@}. Rates are 
measured in a post-analysis and are analyzed in a Markov-like manner, despite storing poten- 
tially long-time history dependence via the a and (3 labels. Importantly, WE permits the history- 
dependent rates k^ to be measured in a post-analysis of a simulation which was reweighted 
without regard to a/j3 histories, or even a simulation which was not reweighted at all. The re- 
sults below are based on simulations reweighted via Eq. © but post-analyzed using the history 
dependence just described. 

There are several advantages to a formulation using a/f3 labeled rates k^ u . (i) Most importantly, 
bias is removed from rate estimates, (ii) Solution of both the a and (3 steady states is performed 
simultaneously via a standard Markov-state analysis of the k^ rate matrix. By contrast, if the a 
and p steady states are independently solved within a Markov formalism, there can be substantial 
ambiguity in how to assign feedback from the target to initial state when the initial state consists 
of more than one bin. (iii) The labeled formulation guarantees, by construction, the flux balance 
intrinsic to equilibrium, namely, Flux(^ ->• B\a) = Flux(B -> A\(3). (iv) The analysis can be 
performed using arbitrary bins (and states defined as sets of these bins). It is not necessary to 
employ the bins originally used to run the WE simulation because a post-analysis can calculate 
rates among any regions of configuration space. 

In general, a post-analysis evaluation of kinetic properties is particularly powerful in permitting 
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the evaulation of different state definitions. Indeed, for complex systems, it is a major challenge to 
generate reasonable state definitions. In our approach, one can use kinetic properties determined 
in post-analysis of varying state definitions to optimally assign state definitions. 

3 Model systems and simulation details 

Equilibrium weighted ensemble simulations were performed on three systems of varying de- 
grees of complexity: a pair of methane molecules, the small peptide alanine dipeptide (AD) and 
chignolin, a 10 residue mini-protein. All simulations were performed at 300A'. For methane 
molecules, GROMACS united-atom interactions were used with TIP3P explicit solvent. For AD 
and chignolin, the all-atom parm99-SB force-field with implicit GBSA solvent was simulated in AM- 
BER. The default GBSA parameters were used in AMBER version 1 1. The molecular dynamics 
time step used for all systems was At = 2 fs. 

The reweighting procedure requires reliable estimates of the rate matrix in order to obtain 
unique and well-defined steady-state solutions. To achieve this, average rate matrices were used 
during the reweighting process. In this work, rate matrix averaging extended over the second 
half of all available simulation data at the time of the reweighting, although other schemes are 
possible. For all simulations, reweighting was performed after each iteration of the WE algorithm. 
An iteration is defined to be the simultaneous propagation of all trajectories in the ensemble for 
some amount of time, r. In these studies, a value of r = 250At is used. 

For each system, progress coordinates were selected and "binned" in a 1D or 2D space. For 
methane, we used the distance r between the two methane molecules following [T8]. The co- 
ordinate r e [0, 16] A was partitioned with a bin spacing of 1 A. For AD, the two dihedral angles 
cf) e [-180, 180] and V e [-180, 180] form the progress coordinate space. We used a 12 x 12 parti- 
tion of the 2D progress coordinate space so that bins are 30° x 30°. For chignolin, two important 
folding-related distances were binned according to the recommendations in [23]. Specifically, the 
distances X = r(ASP30 - GLY7N) e [0, 12] A and Y = r(GLU50 - THR8N) e [0, 12] A were 
used with a bin spacing of 0.5 A in each dimension. 

To highlight the ability of equilibrium WE to compute both equilibrium and non-equilibrium prop- 
erties, different observables are emphasized for each of the systems. For all systems, the potential 
of mean force (PMF) is calculated via -log(P(q)) where P is the probability distribution function and 
q is the reaction coordinate vector (in units of k B T = 1). Note that since trajectories have different 
weights, summing the weights of all trajectories in a given bin yields the probability of that bin. 
Repeating this for all bins yields P(q). Since the PMFs are calculated in post-analysis, they can 
be measured at a finer resolution than the bin spacing used during the WE simulation. From P(q), 
one can also infer the populations of arbitrary states since states are defined as sets of bins. Im- 
portantly, the PMF was used to aid in the selection of states for the MFPT analysis. The mean 
first passage times between arbitrary states can be measured either directly via Eq. [3] or via the 



6 



labeled rate matrix formulation. 

For "labeled" non-Markovian rate estimation, the following procedure was used: 

1 . Estimate the labeled rate matrix using transitions occuring during WE simulation. 

2. Determine the steady state solution. One way to accomplish this is to exponentiate the 
labeled rate matrix and multiply it by the vector p = (1, 1, 1)/2N until it converges to the 
steady state solution p ss . Not all matrices will remain stable under exponentiation (i.e., their 
elements will approach zero). In such cases, the steady state solution is not well defined and 
measurements of the passage time are skipped. Typically, this only occurs in the beginning 
of simulations since new bins are being occupied for the first time and there are no estimates 
of rates out of said bins. 

3. Use the steady state solution p ss along with the labeled rate matrix elements to calculate 
the a flux entering state B and the (3 flux entering A. 

4. Finally, the MFPT values are given by Eq. [3]with fluxes computed according to 



Flux(A -> B\a) = £ P?k-f Flux(£? -> A\0) = £ pffcg" . (5) 

4 Results 

For AD, populations and MPFTs are measured in both directions between the two states as 
shown in Fig. [3l Here, we compare the results of the non-Markovian analysis with direct flux 
measurements to highlight the bias in the latter. Shown in Fig. |4]are the populations and passage 
times for the WE simulations. For reference, estimates from "brute force" molecular dynamics are 
also shown (black line) along with the 95% confidence estimates in the case of the MFPT (dashed 
black lines). The non-Markovian results are in excellent agreement with brute force estimates and 
well within range of statistical error, yet direct measurements show apparent bias. The reason for 
the large discrepancy is because direct measurement of p a and pp yields highly biased values (i.e. 
they differ substantially from brute force), despite the fact that state populations are reasonably 
well converged, thus biasing the MFPT estimates from Eq. [3l 

In the methane system, the PMF and the MFPT from the unbound to bound state were mea- 
sured as described above. Importantly, the passage times can be measured as function of the 
boundary definition for the unbound state from a single simulation. See the PMF inset of Fig. [5) 
Specifically, the bound state B was held fixed at a separation of 4 A while the definition of the 
unbound state A was varied from 4 to 16 A. The passage time from the bound to unbound state 
was measured in increments of 2 A and is shown in Fig. [5j The result is in excellent agreement 



7 



Tin 



150 
100 
50 

3. 
-50 
-100 
-150 



ectorv Locations 



• . 


• • • 


■ 






* 
• 


* ■ ■ 
* 


* * 


* * 


• 


• \' 


• 
• 








• 


1 


m 




* 

• 


• 


■ v. 




















i 










• 










« 


m 


• 










.* 
• • 










• 

•J * 


• 










■ • 


■ 










• * 


• ■ 




i 




















• • • 
• 


• . : 

* * 


»/ 










• 




m 

* 


t 

• 


' : 


• 


■ 


















■ • »■ 






• 

• 








« 

• 

—+ — 


.» 










• 

• 


* 


• 

• 






• 


' ?i 




* 

• 

• 






• 

• 

■ 




■ 

i 












-* 

* 




1 



-150-100 -50 50 100 150 



Figure 3: A snapshot of the trajectory locations in (<p, tp) space during a WE simulation of alanine- 
dipeptide. The state definitions are indicated by the A and B boxes. Red trajectories (a) were 
those last in state A, and blue trajectories (/?) were those last in state B. 



with [18]. Our variation of the boundary permits a physically reasonable selection near which the 
MFPT shows minimal sensitivity. The MFPT in the reverse direction follows a similar qualitative 
behavior (not shown). 

For chignolin, two independent WE simulations were performed in addition to eight indepen- 
dent brute force simulations for comparison. The two WE simulations were initialized from different 
states: one from the native folded structure and one from an extended structure: see Fig. [6] Many 
of the well defined minima from the 2D PMF (Fig. [6j are in good agreement with those found 
in |23]. To illustrate the quality of sampling achieved by equilibrium WE, Fig. [7] below shows the 
probability distributions of the two chosen coordinates as a function of the simulation progression. 
In sharp contrast, brute force simulations remain trapped near the initial conformation given the 
same amount of total CPU time as the WE simulations (1 ^s). The probability distributions ob- 
tained from 8 independent brute force simulation are shown in Fig. [8] and look similar to those in 
Fig.[7]for the (0%-1%) WE simulation range. 
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Figure 4: Populations and mean first passage times for alanine-dipeptide measured both directly 
and using the non-Markovian formalism. Solid lines indicate estimates from independent "brute 
force" simulations, with dashed lines indicating roughly a 95% confidence interval. The states are 
indicated in Fig. [3l 
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Figure 5: The mean first passage time for methane association (A to B) measured from WE 
simulation as a function of the boundary of state A. The inset displays the PMF along with the 
definitions of the unbound and bound states, indicated by A and B, respectively. 



5 Concluding discussion 

In summary, a generalized equilibrium weighted ensemble (WE) approach has been applied 
to measure equilibrium and kinetic properties in molecular systems. Most notably, it yields equi- 
librium and dynamical information in a single simulation. From a theoretical perspective, it is an 
exact approach and makes no Markovian assumption (20). Because a true ensemble picture of 
equilibrium is simulated, arbitrary states can be chosen for analysis after the simulation is run. 
Not only does this provide great flexibility, but the WE trajectories themselves can guide the state 
definition process, a topic we hope to expand upon in future work. WE is also a parallel method 
characterized by essentially perfect scaling because communication among processors is intermit- 
tent. Although our current implementation employs user pre-defined bins, bins can be adaptively 
chosen (THUS] as we will explore more fully in future work. 

The formulation of a non-Markovian model of highly history-dependent bin populations is par- 
ticularly important. Besides simplifying the calculation of non-Markovian kinetic quantities (i.e., 
rates between arbitrary states), the approach also removes a subtle but important bias that can 
occur when WE simulations are reweighted/resampled solely on the basis of equilibrium observ- 
ables. 
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Figure 6: Two-dimensional PMF obtained from a WE simulation of chignolin as a function of the 
progress coordinates X =r(ASP30-GLY7N) and Y=r(GLU50-THR8N). The PMF was averaged 
over the second half of the simulation data. Simulations were initialized from the native and non- 
native (extended) states as indicated by the diamonds. 
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Figure 7: Evolution of the probability distributions P(X) and P(Y) for two WE simulations of chig- 
nolin. Blue curves are those initialized from the native state and red were initialized from the 
non-native. Shown are the distributions averaged over four periods of the WE simulation. At the 
end, both simulations approach the same distribution. 




X = r(ASP30-GLY7N> (A) Y = r(GLUSO-THR8N) (A) 



Figure 8: Comparison of WE to brute force (BF). The two WE simulations converge to one another 
despite starting in distant basins, whereas equal time BF simulations (8 trials) remained trapped 
near the initial conformation. 
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Appendix: Alternative solution for equilibrium bin populations based 
on minimal uncertainties 

Although the steady-state solution of the rate matrix was used to calculate equilibrium bin pop- 
ulations in the results described above, other procedures are possible (because the equilibrium 
solution is over-determined given imperfect rate estimates from finite data). One alternative is 
to use the pairwise probability ratios from Eq. Q} to form clusters and ultimately to calculate all 
equilibrium populations, as described below. The basic strategy is to join bins (and fix their prob- 
ability ratio) starting from pairs with lowest uncertainty in the corresponding rates — cf. Eq. (TJ. In 
general, separate clusters will be built up and then joined together. 

One concrete procedure employed in our initial studies and in work to be presented elsewhere 
comprises the following series of steps: 

1 . Determine the pair of bins i and j whose estimated population ratio pi/pj has the minimum 
uncertainty among pairs not yet joined. 

2. Join i and j in a cluster. If either or both are already in a cluster, join together whatever 
objects "own" bins i and j. Letting P a and P b denote the probabilities of the clusters owning 
bins i and j, the necessary ratio of cluster probabilities is P a /Pb = (P a /Pi) ipi/Pj) (pj/Pb)- 
This relation applies even if i was previously unclustered by setting P a = pi or analogously 
for j. Note that the fractional probability of a bin in a cluster is determined by the joining 
process. 

3. Repeat from first step until all bins exhibiting any non-zero transition rate with another bin are 
joined into the smallest number of clusters possible. The total probability of a given cluster 
is not changed from the initial sum of probabilities of the constituent bins. 

Once all bins are joined into a single cluster, the fractional bin populations for that cluster are 
the estimated equilibrium populations. 
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